Goto

Collaborating Authors

 asymptotic normality and confidence interval


Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

This paper studies two-layers Neural Networks (NN), where the first layer contains random weights, and the second layer is trained using Ridge regularization. This model has been the focus of numerous recent works, showing that despite its simplicity, it captures some of the empirically observed behaviors of NN in the overparametrized regime, such as the double-descent curve where the generalization error decreases as the number of weights increases to $+\infty$. This paper establishes asymptotic distribution results for this 2-layers NN model in the regime where the ratios $\frac p n$ and $\frac d n$ have finite limits, where $n$ is the sample size, $p$ the ambient dimension and $d$ is the width of the first layer. We show that a weighted average of the derivatives of the trained NN at the observed data is asymptotically normal, in a setting with Lipschitz activation functions in a linear regression response with Gaussian features under possibly non-linear perturbations. We then leverage this asymptotic normality result to construct confidence intervals (CIs) for single components of the unknown regression vector. The novelty of our results are threefold: (1) Despite the nonlinearity induced by the activation function, we characterize the asymptotic distribution of a weighted average of the gradients of the network after training; (2) It provides the first frequentist uncertainty quantification guarantees, in the form of valid ($1\text{-}\alpha$)-CIs, based on NN estimates; (3) It shows that the double-descent phenomenon occurs in terms of the length of the CIs, with the length increasing and then decreasing as $\frac d n\nearrow +\infty$ for certain fixed values of $\frac p n$. We also provide a toolbox to predict the length of CIs numerically, which lets us compare activation functions and other parameters in terms of CI length.



Review for NeurIPS paper: Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

The reviewers point out that this is a borderline submission. They reasonably questions several things in the paper: - it is not clear why the coefficients for which the CLT holds for are important; - assumptions are restrictive; - the paper studies too simplistic of a model; - parts of the analysis are unclear; - writing is done hastily with typos lingering around. After my own reading, I agree with these comments. On the other hand, the reviewers also point out that there are certain aspects of double descent that are not previously explored, which are of more interest compared to the confidence intervals. My opinion is that the paper would be much stronger if the cons were addressed in a revised manuscript.


Review for NeurIPS paper: Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

Additional Feedback: I will increase my score if my concerns are addressed and if the authors could correct my potential misunderstanding. 1. I find the "double descent" phenomenon in the CL length to be interesting. Intuitively, the uncertainty of the model could relate to the variance of the prediction, which we know might blow up at the interpolation threshold due to the variance from label noise or from initialization. Can the author comment on the plausible mechanism of this observation? In this case what would be the motivation of considering a nonlinear perturbation, which would basically be adding noise? 3. The result in Section 2.4 (based on Mei and Montanari 2019) seems to be under the assumption of iid weight matrix W. I might have missed something, but is there a place the authors discussed that this characterization also holds for arbitrary W (independent of X) with bounded spectral norm? 4. (minor) Does the characterization also holds for the ridgeless limit (\lambda 0)? 5. (minor) On Figure 2 Left, why is there a discrepancy between the predicted and simulated boxplot? 6. (minor) Although this is not the motivation of the work, the mentioned connection between NN and RF model typically requires significant overparameterization, and thus the current proportional scaling of n and d might not be the right setup.


Asymptotic normality and confidence intervals for derivatives of 2-layers neural network in the random features model

Neural Information Processing Systems

This paper studies two-layers Neural Networks (NN), where the first layer contains random weights, and the second layer is trained using Ridge regularization. This model has been the focus of numerous recent works, showing that despite its simplicity, it captures some of the empirically observed behaviors of NN in the overparametrized regime, such as the double-descent curve where the generalization error decreases as the number of weights increases to \infty . This paper establishes asymptotic distribution results for this 2-layers NN model in the regime where the ratios \frac p n and \frac d n have finite limits, where n is the sample size, p the ambient dimension and d is the width of the first layer. We show that a weighted average of the derivatives of the trained NN at the observed data is asymptotically normal, in a setting with Lipschitz activation functions in a linear regression response with Gaussian features under possibly non-linear perturbations. We then leverage this asymptotic normality result to construct confidence intervals (CIs) for single components of the unknown regression vector.